predict.info — Premium Domain For Sale Domain only: USD 200,000. Prediction platform technology priced separately. predict.info
SWE Bench AI News List | Blockchain.News
AI News List

List of AI News about SWE Bench

Time Details
2026-06-11
10:38
Claude Fable 5 Breakthrough hits 80.3% SWE-bench

According to AINewsOfficial_, Claude Fable 5 posts 80.3% on SWE-bench Pro and adds 1M context with 128k output for multi-day autonomy, surpassing rivals.

Source
2026-05-19
08:04
Claude Opus 4.7 Regression Sparks Dev Backlash

According to @godofprompt, Opus 4.7 ignores project instructions and skips MCP configs; Anthropic acknowledged regressions versus 4.6 despite higher benchmarks.

Source
2026-05-11
08:38
Kimi K2.6 Disrupts Claude with 1/6 price

According to @_avichawla, Kimi K2.6 matches Claude’s chat, code, cowork at 1/6 price, ranks #1 on OpenRouter, and posts 58.6 on SWE-Bench Pro.

Source
2026-05-09
22:15
Claude Opus 4.7 Boosts SWE-bench to 87.6%

According to @godofprompt, Claude Opus 4.7 follows instructions literally, lifts SWE-bench to 87.6% from 80.8%, and breaks 4.6-tuned prompts.

Source
2026-04-09
18:28
Claude Sonnet Plus Opus Advisor Boosts SWE-bench Multilingual by 2.7 Points at 11.9% Lower Cost — Latest Evaluation Analysis

According to @claudeai on Twitter, Sonnet paired with an Opus advisor achieved a 2.7 percentage point higher score on SWE-bench Multilingual than Sonnet alone while reducing per-task cost by 11.9%. As reported by the Claude account post, this advisor-enhanced workflow indicates measurable quality gains and cost efficiency in multilingual software engineering benchmarks. For AI product teams, the data suggests a practical orchestration strategy: route primary reasoning to Sonnet and use Opus selectively for guidance to improve pass rates and lower run-time spending. According to the tweet, these results come from evals on SWE-bench Multilingual, highlighting a repeatable method for cost-aware performance optimization in LLM-based coding assistants.

Source
2026-02-27
12:10
MiniMax M2.5 Beats Opus 4.6 on SWE-Bench Verified: 80.2% Score, 3x Faster, $1 Hour—AI Coding Benchmark Analysis

According to God of Prompt on X (Twitter), MiniMax M2.5 surpassed Opus 4.6 on the SWE-Bench Verified benchmark with an 80.2% score, delivers roughly 3x faster execution, and is offered at a flat $1 per hour, while using only 10B activated parameters, positioning it as the smallest Tier-1 model for coding tasks. As reported by the same source, these metrics imply lower latency and significantly reduced inference cost, enabling 24/7 autonomous coding agents and continuous integration bots at practical budgets. According to the post, the combination of high benchmark accuracy and small active parameter count suggests strong efficiency-per-dollar, which can improve ROI for software teams deploying code assistants, test repair bots, and maintenance agents in production pipelines.

Source